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Abstract. Fuzzy clustering methods allow the objects to belong to several 
clusters simultaneously, with different degrees of membership. However, a 
factor that influences the performance of fuzzy algorithms is the value of 
fuzziher parameter. In this paper, we propose a fuzzy clustering procedure 
for data (time) series that does not depend on the dehnition of a fuzziher 
parameter. It comes from two approaches, theoretically motivated for un¬ 
supervised and supervised classihcation cases, respectively. The hrst is the 
Probabilistic Distance (PD) clustering procedure. The second is the well 
known Boosting philosophy. Our idea is to adopt a boosting prospective for 
unsupervised learning problems, in particular we face with non hierarchical 
clustering problems. The aim is to assign each instance (i.e. a series) of 
a data set to a cluster. We assume the representative instance of a given 
cluster (i.e. the cluster center) as a target instance, a loss function as a syn¬ 
thetic index of the global performance and the probability of each instance 
to belong to a given cluster as the individual contribution of a given instance 
to the overall solution. The global performance of the proposed method is 
investigated by various experiments. 

Keywords: Fuzzy Clustering, Boosting, PD clustering 

1 Introduction 

We propose a fuzzy approach for clustering data (time) series. The goal of 
clustering is to discover groups so that objects within a cluster have high 
similarity among them, and at the same time they are dissimilar to objects 
in other clusters. Many clustering algorithms for time series have been intro¬ 
duced in the literature. Since clusters can formally be seen as subsets of the 
data set, one possible classihcation of clustering methods can be according 
to whether the subsets are fuzzy (soft) or crisp (hard). Let D be a data set 
consisting of N series {yi,y2, ■■■jUn} C M” and let K be an integer, with 

2 < K < N, the goal is to partition V into Ck groups. Crisp clustering 
methods are based on classical set theory, and restrict that each object of 
data set belongs to exactly one cluster. It means partitioning the data V 
into a specihed number of mutually exclusive clusters €±,€2, ...Ck- 

A hard partition of V can be dehned as a family of subsets Ck that satishes 
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the following properties (Bezdek, 1981): 

K 

[jCk = V, 

k=l 

Cfc n Ch = 0, h 

^ cCk CV, 1 < k < K. 

Let Hik be the membership function and let U = [nik] be the N x K partition 
matrix. The elements of U must satisfy the following conditions: 


£ {0,1}, 1 < k < K, 1 < i < N] 

K 

^ ^ f^ik — Ij 
fc=l 
N 

0 < N. 

i=l 

The k^^ column of U contains value of fXik of the k^^ subset Ck of V. 

In a hard partition, fikiUi) is the indicator function: 

/ \ _ T Vi ^ Cki 

\ 0, otherwise 

Following Bezdek (1981) the hard partionining space is thus dehned by: 

K N 

Me = {U G e {0,1}, Vi, k; ^ Hik = 1, Vi, 0 < ^ fUk < N, V/c}, 

k=l i=l 

Me being the space of all possible hard partition matrices for T). 
Generalizing the crisp partition, U is a fuzzy partitions of V with elements ^ik 
of the partition matrix bearing real values in [0,1] (Kaufman and Rousseeuw, 
2009). 

The idea of fuzzy set was conceveid by Zadeh (2009). Fuzzy clustering meth¬ 
ods do not assign objects to a cluster but suggest degrees of membership 
to each group. The larger is the value of the membership value for a given 
object with respect to a cluster, the larger is the probability of that object 
to be assigned to that cluster. 
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Similarly to crisping conditions, Ruspini (1970) defined the following fuzzy 
partition properties: 


Hik G [0,1], 1 < k < K, 1 < i < N; 

K 

^ ^ f^ik — 1 ) 
k=l 
N 

0 <^l^ik < N. 

i=l 

The fuzzy partitioning space is the set: 

K N 

Mf = {Ue e [0,l]yt,k-J2k^^k = 1,V^,0 < < Nyk.} 

k=l i=l 

Several clustering criteria have been proposed to identify fuzzy partition in 
V. Among these proposals, the most popular method is fuzzy c-means. 
Proposed by Dunn (1973) and developed by Bezdek (1981), fuzzy c-means 
considers each data point as a possible member of multiple clusters with a 
membership value. This algorithm is based on minimization of the following 
objective function: 


N K 

i = (1) 

i=l k=l 

S.t. 


fiik G [0,1], Vi, k] 

_ -I 

z^k=i k-ik — 

0<Elikik<N. 

In the equation ([^, m is any real number greater than 1, /ij^ is the degree 
of membership of Ui in the cluster k and H-H is any norm expressing the 
similarity between any measured data and the center. The parameter m is 
called fuzzifier or weighting coefficient. To perform fuzzy partitioning, the 
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number of clusters and the weighting coefficient have to be choosen. The 
procedure is carried out through an iterative optimization of the objective 
function shown above, with the update of membership value Hik and the 
cluster centers by solving: 








( 2 ) 


K 

E 

h=l 

The loop will stop when 

I (^+1) (0i ^ 

maxik\^i\k 

where £ is a small number for stopping the iterative procedure, and I indicates 
the iteration steps. The algorithm is synthesized in box 


(- CkW^ 


- ChF 


i = k = l,...,K. (3) 



Box 1 Fuzzyc-means algorithm 

Initialize: K = number of centers, m, (1 < m < oo), e = & small threshold. 
Set the counter I = 1 and initialize the matrix of the fuzzy c— partitions 

U = [ufZ- 

while > £ do 

- Calculate the cluster center, by using equation ([^. 

- Update the membership matrix U = [nik] by using equation (1^, if 

otherwise set = 1 if / = i or set = 0 if / 7 ^ i. 

- Compute 

if > e then 

- Set 1 = 1 + 1 
end if 
end while 

ontpnt: estimated centers Cfc, membership matrix U. 


One of limitations of fuzzy c-means clustering is the value of fuzziher m. A 
large fuzziher value tends to mask outliers in data sets, i.e. the larger m, the 
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more clusters share their objects and viceversa. For m —)■ cx) all data objects 
have identical membership to each cluster, for m = 1, the method becomes 
equivalent to A;-means. The role of the weighting exponent has been well 
investigated in literature. 

Pal and Bezdek (1995) suggested taking m G [1.5,2.5]. 

Dembele and Kastner (2003) obtain the fuzziher with an empirical method 
calculating the coefficient of variation of a function of the distances between 
all objects of the entire datset. 

Yu et al. (2004) proposed a theoretical upper bound for m that can prevent 
the sample mean from being the unique optimizer of a fuzzy c-means objective 
functions. 

Futschik and Carlisle (2005) search for a minimal fuzziher value for which the 
cluster anlysis of the randomized data set produces no meaningful results, 
by comparing a modihed partitions coefficient for different valnes of both 
parameters. 

Schwammle and Jensen (2010) showed that the optimal fuzzher takes valnes 
far from the its freqnently used value equal to 2. The authors introduced 
a method to determine the value of the fuzziher without using the current 
working data set. Then for high dimensional ones, the fuzziher value depends 
directly on the dimension of data set and its number of objects. For low 
dimensional data set with small number of objects, the authors reduce the 
search space to hnd the optimal value of the fuzziher. According to the 
authors, this improvement helps choosing the right parameter and saving 
computational time when processing large data set. 

On the basis of a robust selection analysis of the algorithm, Wn (2012) founds 
that a large valne of m will make fnzzy c-means algorithm more robust to 
noise and ontliers. The anthor suggested to use value of fuzziher ranging 
between 1.5 and 4. 

Since the weighting coefficient determines the fuzziness of the resulting clas- 
sihcation, we propose a method that is independent from the choice of the 
fuzziher. It comes from two approaches, theoretically motivated for unsu¬ 
pervised and supervised classihcation cases respectively. The hrst is the 
Probabilistic Distance (PD) clustering procedure dehned by Ben Israel and 
lyigun (2008). The second is the well known Boosting philosophy. From 
the PD approach we took the idea of determining the probabilities of each 
series to any of the k clusters. As this probability is unequivocally related to 
the distance of each series from the centers, there are no degrees of freedom 
in determine the membership matrix. From the Boosting approach (Frennd 
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and Schapire, 1997) we took the idea of weighting each series according some 
measnre of badness of £t in order to dehne an nnsnpervised learning process 
based on a weighted re-sampling procednre. As a learner for the boosting 
procednre we nse a smoothing spline approach. Among the smoothing spline 
techniqnes, we chose the penalized spline approach (Eilers and Marx, 1996) 
becanse of its flexibility and compntational efficiency. This paper is orga¬ 
nized as follows: Section contains onr proposal, in Section the resnlts of 
some experimental evalnation stndies are carried ont and some concluding 
remarks are presented in Section 

2 Boosted-oriented probabilistic clustering of 
time series 

2.1 The key idea 

The boosting approach is based on the idea that a supervised learning al¬ 
gorithm (weak learner) improves its performance by learning from its errors 
(Freund and Schapire, 1997). It consists of an ensemble method that work 
with a resampling procedure (Dietterich, 2000). The general idea is to run 
several times the supervised learning algorithm and assigning a weight to 
each instance of a data set that governs the resampling (with replacement) 
process during the iterations. The weights are set in such a way that the 
misclassihed instances gets a weight larger than the weight assigned to well 
classified instances. In this way, the probability to be included in the sample 
during the iterations is higher for those instances for which the supervised 
learning algorithm returns a wrong classification. There exist boosting algo¬ 
rithms for both classification and regression problems (Freund and Schapire, 
1997; Dietterich, 2000; Eibl and Pfeiffer, 2002; Gey and Poggi, 2006). In 
both cases the weighting system combines a synthetic index of the perfor¬ 
mance of the supervised learning algorithm with some index that represents 
the individual contribution of a given instance to the overall solution. Our 
idea is to adapt the boosting philosophy to unsupervised learning problems, 
specially to non hierarchical cluster analysis. In such a case there not exists 
a target variable, but as the goal is to assign each instance (i.e. a series) to 
a cluster, we have a target instance. In other words, we switch from a target 
variable to a target instance point of view. We take each cluster center as 
a representative instance for each series and we assume as a synthetic index 
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of the global performance a loss function to be minimized. The probability 
of each instance to belong to a given cluster is assumed to be the individual 
contribution of a given instance to the overall solution. In contrast to the 
boosting approach, the larger the probability of a given series to be member 
of a given cluster, the larger the weight of that series in the resampling pro¬ 
cess. As a learner either a smoothing spline techniques or a regression model 
can be used. We decided to use a penalized spline smoother because of its 
flexibility and computational efficiency. To define the probabilities of each 
series to belong to a given cluster we use the PD clustering approach (Ben 
Israel and lyigun, 2008). This approach allows us to define a suitable loss 
function and, at the same time, to propose a fuzzy clustering procedure that 
does not depend on the definition of a fuzzifier parameter. 

2.2 P-splines in a nutshell 

P-splines have been introduced by Eilers and Marx (1996) as flexible smooth¬ 
ing procedures combining B-splines (de Boor, 1978) and difference penalties. 
Suppose to observe a set of data {x, where the vector x indicates the 

independent variable (e.g. time) and y the dependent one. We want to de¬ 
scribe the available measurements through an appropriate smooth function. 
Denote Bj{x-,p) the value of the i — th B-spline of degree p defined on a 
domain spanned by equidistant knots (in case of not equally spaced knots 
our reasoning can be generalized using divided differences). A curve that 
fits the data is given by y{x) = YTj=i where aj (with j = 1, ...,n) 

are the estimated B-splines coefficients. Unfortunately the curve obtained 
by minimizing \\y — Ba\\^ w.r.t. a shows more variation than is justified by 
the data if a dense set of spline functions is used. To avoid this overfitting 
tendency it is possible to estimate a using a generous number of bases in a 
penalized regression framework 

a = argmin II?/— i?a||^-|-A||Da||^, (4) 

a 

where D is a order difference penalty matrix and A is a smoothing pa¬ 
rameter. Second or third order difference penalties are suitable in many 
applications. 

The optimal spline coefficients follow from (|^ as: 

a = {B'^B + \D'^D)-^B'^y. 
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(5) 


The smoothing parameter A controls the trade-off between smoothness and 
goodness of £t. For A —)■ cx) the final estimates tend to be constant while for 
A —)■ 0 the smoother tends to interpolate the observations. 

Popular methods for smoothing parameter selection are the Akaike Informa¬ 
tion Criterion and Cross Validation. AIC estimates the predictive log likeli¬ 
hood, by correcting the log likelihood of the fitted model (A) by its effective 
dimension (ED): AIC = 2ED — 2A. Following Hastie and Tibshirani (1990) 
we can compute the effective dimension as ED = B + XD"' B] 

for the P-spline smoother and 


2^ 


—2nln 




iVi -Vif 

dn 


where d is the maximum likelihood estimate of a. But d^ = ~ 

so the second term of £ is a constant. Hence the AIC can be written as 


AIC(A) = 2ED -|- 2n In d. 


The optimal parameter is the one that minimizes the value of AIC{X). 
LOO-CV chooses the value of A that minimizes 


CV(A) = 

i=j 


Vi - Vj 
1 ~ ^jj 


1 2 


where hjj is the jth diagonal entry of H = B{B^B + XD^D) ^B~^. 
Analogous to CV is the generalized cross validation measure (Whaba, 1990) 


GCV(A) = ^ 


Vj - Vj 


2 


n — ED 


where ED = tr(Fr). In analogy with cross validation we select the smoothing 
parameter that minimizes GCV(A). 

All these selection procedures suffer of two drawbacks: 1) they require the 
computation of the effective model dimension which can become time con¬ 
suming for long data series, and 2) they are sensitive to serial correlation in 
the noise around the trend. The L-curve (Hansen, 1992) and the derived V- 
curve criteria (Frasso and Filers, 2015) overcome these hitches. The L-curve 
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is a parameterized curve comparing the two ingredients of every regulariza¬ 
tion or smoothing procedure: badness of the £t and roughness of the final 
estimate. For a P-spline smoother, the following quantities can be defined 

MA);0(A)} = {||y-Ba|r;||Da|p}. 

The L-curve is obtained by plotting 'ip{X) = log(a;) against 0(A) = log(6*). 
This plot typically shows a L-shaped curve and the optimal amount of 
smoothing is located in the corner of the “L” by maximizing the local cur¬ 
vature measure. The V-curve criterion offers a valuable simplihcation of the 
searching criterion by requiring the minimization of the Euclidean distance 
between the adjacent points on the L-curve and, like in plots of AIC or GCV, 
the graphical presentation of the V-curve has an axis for A that can be read 
off. The V-curve criterion is computed as follows: 



2.3 PD clustering approach 

Let "D be a dataset consisting of N series {yi,y 2 , C M"' and let Ck be 

cluster, with k G (1, K), partitioning P. We suppose that each series has 
the same domain of length n. 

At each cluster Ck is associated a cluster center c^, with k = 1,..., iF. 

Let di^k = d{yi, Ck) be a distance function of the series from the cluster 
center. 

Let Pi^k = P{yiiCk) be the probability of the i*^ series belonging to the k*^ 
cluster. 

For each series y &T> and each cluster Ck, we assume the following relation 
between probabilities and distances (Ben Israel and lyigun, 2008): 

Pi,kd^k = constant. (7) 

The constant in ([^ only depends on series y and it is independent of the 
cluster k. Equation ([^ allows to to dehne the membership probabilities as 
(Heiser, 2004; Ben-Israel and lyigun, 2008) 


Pi.k — 




'i2k=i d-i, 




( 8 ) 
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2.4 The algorithm 

Since the probabilities as defined in equation ([^ sum up to one among the 
clusters, we use the quantity as a measure of compliance represen¬ 

tation of the i — th series with respect to the overall solution of the clustering 
procedure. It is easy to note that Y\k=i = 0 if the i — th series coincides 
with the k — th cluster center, as well as ^hk = if there is maxi¬ 

mum uncertainty in assigning the i — th series to any cluster center. For this 
reason we use as measure of cluster compliance solution the quantity 

.. N K 

BC'=-^(na,OA’'". (9) 

i=l k=l 

Equation is a synthetic uncertainty clustering measure: the lower its 
value, the better the solution. It equals zero when there is a perfect solution 
(i.e., each series has probability equal to one to belong to some cluster cen¬ 
ter). The maximum possible value of equation (|^ is 1, when each series has 
probability equal to K~^ to belong to each of the K cluster. The BC index 
allows to compare the overall clustering solution when the number K of the 
clusters differs. 

From equation we define the following loss function to be minimized as 

N K 

i=l k=l 

Let = di^k/^CLX^=idi^k be the contribution of the i — th series to generate 
the k — th cluster. 

Let r be a X iF indicator matrix whose entries are 1 if Pi^k > Pi,h {k, h = 
1 ,..., iF, k yti h) and —1 otherwise. 

We define the weight of the i — th series for the k — th cluster as 

Wi,k = 

For each cluster k, the weights are first normalized in this way: 

• '^i,k 

'^i,k ~ ’ 

^h=l 
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then within each cluster we set 


Wi,k = 


w 


i,k 


l^i=l 


W, 


i.k 


( 11 ) 


For each cluster k, a sample is extracted with replacement from T), taking 
in account equation (0. Then the cluster centers = Ba, k = 1,..., K 
are estimated by using a P-spline smoother. These centers are then used to 
compute the membership probabilities according to equation (|^ for the next 
iteration. The cluster centers are re-estimated and adaptively updated with 
an optimal spline smoother. 

The choice of the metric depends on the nature of the series, the optimal P- 
spline smoothing procedure frames our approach in the class of model-based 
clustering techniques but any suitable smoother can be adopted. Box|^shows 
the pseudo-code of our the Boosted-Oriented Smoothing Spline Probabilistic 
Clustering algorithm. 

The procedure described in box is repeated a certain number of time due 
to the sensitivity of hnal solution to the random choice of cluster center. 


3 Experimental evaluation 

To evaluate the performance of the proposed algorithm, we conducted three 
experiments. In estimating the optimal P-splines smoother, always we used 
the V-curve criterion as in equation (|^ to select the optimal A parameter, 
and we used a number of interior knots equal to min(|; 40), in which n is the 
length of time domain, as suggested by Ruppert (2002). Moreover we need 
a measure of goodness of fuzzy partitions. To reach this aim, we decided 
to use a fuzzy variant of the Rand Index proposed by Hullermeier et al. 
(2012). This index is dehned by the complement to 1 of the normalized sum 
of degree of discordance. The Rand index developed by Rand (1971) is a 
external evaluation measure to compare the clustering partitions on a set of 
data. The problem of evaluating the solution of a fuzzy clustering algorithm 
with the Rand index is that it requires converting the soft partition into a 
hard one, losing information. 

As shown in Campello (2007), different fuzzy partitions describing different 
structures in the data may lead to the same crisp partition and then in the 
same Rand index value. For this reason the Rand index is not appropriate 
for fuzzy clustering assessment. 
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Box 2 Boosted-oriented smoothing-spline probabilistic clustering of time 
series_ 

input V 

initialize: maxiter = maximum number of iterations; K = the number 
of clusters; d = a. suitable distance measure; Ck, k = 1 K random 

cluster centers, 
for iter=l:maxiter do 

- compute the N x K distance matrix D = [di^k] Vi, fc; 

- compute the membership probabilities P = [Pik] Vi, k as in equation 

(§; 

- compute equation ( [l0| ); 

- assign the weights to each series for each cluster and compute the 
N X K matrix W as in equation ( [TT| ; 

for A: = 1 : do 

- extract the sample from P 

- compute center = Bak 

end for 

if iter = 1 then 

- = Bdk 
else 

for k = 1 \ K do 

- update cluster centers ^ = Bd\, 
with dl = {B'^B + 

end for 
end if 
end for 

output: estimated cluster centers membership probabilities matrix P. 


To overcome this problem Hiillermeier et al. (2012) proposed a generalization 
of the Rand index for fuzzy partitions. We recall some essential background. 
Let V = {Vi, ..., Vk} be a fuzzy partition of the data set V, each element 
Hi eP is characterized by its membership vector: 

V. = {Vi{yi),V2{yi),Vk{y^), •.., PM) e [0, l]^ (12) 

where Vk{yd is the degree membership of the i — th series to the k — th cluster 
Vk- Given any pair {yi,yi) G P, Hellermeier et al. (2012) defined a fuzzy 
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equivalence relation on T> in terms of similarity measure on the associated 
membership vectors (12). Generally, this relation is of the form: 


E-p — 1 — 11 — Ei’ 11 

where || ■ || represents the Linorm divided by 2 that constitutes a proper metric 
on [0,1]^ and yields value on [0,1]. Ep is equal to 1 if and only if yi and 
have the same membership pattern and is equal to 0 otherwise. The basic 
idea of the authors to reach the fuzzy extension of the Rand index was to 
generalize the concept of concordance in the following way. 

Given 2 fuzzy partition, V and Q and considering a pair {yi,yi) as being 
concordant as V and Q agree on its degree of equivalence, they dehned the 
degree of concordance as 


conc{yi,y[) = 1- || Ep{yi,y'-) - EQ(yi,y/) || G [0,1], 
and degree of discordance as: 


disc{yi,y'i) =|| Ep{yi,y[) - EQ{yi,y\) || e [0,1]. 


Finally, the distance measure proposed by Hiillermeier et al. (2012) is dehned 
as the normalized sum of degrees of discordance: 


d{V, Q) 


T.{y,,y'^)ev\\^riyi,y'i} - EQ{yi,y'i)\\ 

iV(iV- l)/2 


The direct generalization of the Rand index corresponds to the normalized 
degree of concordance and it is equal to: 


Re{V, Q) = l- d{V, Q) 


and it reduces to the original Rand index when partitions V and Q are non- 
fuzzy. 


As true fuzzy partition, we always computed the true cluster centers with an 
optimal P-spline smoother, and then we computed the true probabilities by 
applying equation ([^. 
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3.1 Simulated data 

As a first experiment, we generated K = Q clusters of numerical series at 
n = 10 equally spaced time points in [0,1] as described in Coffey et al. 
(2014). Distinct cluster specific models were used (subscript i refers to the 
series, subscript j refers to the time domain): 



where: 



and Eij is an autoregressive model of order 1. 

Cluster means were chosen to reflect the situation where there are series that 
show little variation in value over time (as given by cluster 3) and series 
which have distinct signal over time. Cluster sizes were equal to 90, 50, 100, 
25, 60 and 35, for cluster 1, 2, 3,4, 5, 6 respectively, giving a total number of 
360 simulated series. Data set is plotted in Figjl} 


Figure about here. 


Given the nature of the simulated series, we are interested in the similarity 
of the shape of the series. For this reason the chosen metric was the Penrose 
shape distance (Penrose, 1952), dehned as: 



( 13 ) 
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where djj is the squared average Euclidean distance coefficient and qfj = 

We performed five analyses with 100, 500,1000, 5000 and 10000 boosting 
iterations. In all cases we set 10 random starting points. Figure shows the 
behavior of the BC function as dehned in equation (|^ during the boosting 
iterations. In this case the BC values appear to be non-in creasing as the 
number of iterations increases. The values of the BC function are equal to 
0.3615, 0.2783, 0.2643,0.2584, 0.2583 for 100, 500,1000, 5000 and 10000 boost¬ 
ing iterations respectively. 


Figure 1^ about here. 

All the solutions return in fact the same results in terms of estimated centers: 
in example, hgure shows the estimated cluster centers for each cluster as 
returned by the first analysis. 


Figure 1^ about here. 

For this data set, by using the Penrose shape distance, the Fuzzy Rand 
Index is equal to 0.8599,0.8954,0.9059,0.9178 and 0.9194 for the solutions 
with respectively 100,500,1000,5000 and 10000 boosting iterations. Even 
if the solutions in terms of ’’hard” clustering are the same, the difference 
in terms of fuzzy rand index indicates that the partitions returned by the 
proposed algorithm are really close to the true one. The true value of the 
BC index is 0.1977. 

3.2 Synthetic data set 

Synthetic.tseries data set is freely available from the TSclust R-package 
(Montero and Vilar, 2014). Synthetic.tseries data consist of three partial 
realizations of length n = 200 of six Erst order autoregressive models. Figure 
1 ^ shows separately the six groups of series. 


Figure 1^ about here. 


16 


Subplot (a) shows an AR(1) process with moderate autocorrelation. Subplot 
(b) contains series from a bi-linear process with approximately quadratic 
conditional mean. Subplot (c) is formed by an exponential autoregressive 
model with a more complex non-linear structure. Subplot (d) shows a self¬ 
exciting threshold autoregressive model with a relatively strong non-linearity. 

Subplot (e) contains series generated by a general non-linear autoregressive 
model and subplot (f) shows a smooth transition autoregressive model pre¬ 
senting a weak non-linear structure. As we did not generated these series we 
do not show completely the simulation setting. For more details about the 
generating models we refer to Montero and Vilar (2014), pag. 24. 

Assuming that the aim of cluster analysis is to discover the similarity between 
underlying models, the ’’true” cluster solution is given by the six clusters in¬ 
volving the three series from the same generating model. Given the nature 
of the data set considered, we use a periodogram-based distance measure 
proposed by Caiado at al. (2006). It assesses the dissimilarity between the 
corresponding spectral representation of time series. 

By following also the suggestion of to Montero and Vilar (2014), an inter¬ 
esting alternative to measnre the dissimilarity between time series is the 
frequency domain approach. Power spectrum analysis is concerned with the 
distribution of the signal power in the frequency domain. The power-spectral 
density is dehned as the Fourier transform of the antocorrelation function of 
i — th series. It is a measure of self-similarity of a signal with its delayed 
version. The classic method for estimation of the power spectral density of 
an n-sample record is the periodogram introduced by Schuster (1897). Let y 
and y be two time series of length n. 

Let fj = 27rj/n, j = 1,..., n/2 in the range 0 to tt, be the freqnencies of the 
series. 

Let PSDyifj) = i \ytifj) exp (-R/j)P and PSDy>{fj) = ^ Iv'tifj) exp (-R/i)P 
be the periodograms of series y and y , respectively. 

Finally, the dissimilarity measnre between y and y proposed by Caiado et al. 

(2006) is dehned as the Euclidean distance between periodogram ordinates : 


d 


y,y' 




(n/2) 

Y,[PSDy{f,)-PSD^.{f,)\\ 


1=1 


(14) 


We performed onr analysis by setting 800 boosting iterations and 10 random 
starting points. 


17 



Table shows the results of applying our algorithm to the Synthetic.tseries 
data set. Each series is assigned to the estimated cluster according to the 
value of the membership probability matrix. In order to obtain the Fuzzy 
Rand Index, we computed the true cluster centers with a periodogram mod¬ 
eled by P-spline , and then we computed the true probabilities by applying 
equation ([^ by using the periodogram-based distance as in equation (14). 
The Fuzzy Rand is equal to 0.9698. Even if the solutions in terms of ’’hard” 
clustering seems to be excellent (since only series is misclassihed), the differ¬ 
ence in terms of Fuzzy Rand index indicates that the partitions returned by 
the algorithm are really close to the true one. 


Tabled] about here. 

3.3 A real data example 

The ’’growth” data set is freely available from the internal repository of the 
R-package fda (Ramsay et al., 2012). This data set comes from the Berkeley 
Growth Study (Tuddenham and Snyder, 1954). Left hand side of hgure 
shows the growth curves of 93 children, 39 boys and 54 girls, starting by the 
age of one year till the age of 18. The right hand side of the same hgure 
displays the corresponding growth velocities. 


Figure 1^ about here. 

In the framework of cluster analysis this data set was mainly used for prob¬ 
lems of clustering of misaligned data (Sangalli et al, 2010a, 2010b). We 
performed two analyses with 800 boosting iterations and with 10 random 
starting point with k = 2. In the hrst partitioning analysis we used the Eu¬ 
clidean distance. The estimated centers of both the growth curves and the 
growth velocity curves are displayed respectively in the left and right hand 
side of hgure As it can be noted. Euclidean distance discriminates between 
children growing more and children growing less. This can be appreciated by 
looking at left hand side of the same hgure. On average, as expected, boys 
grow more than girls. 


Figure 1^ about here. 
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Nevertheless, Euclidean distance does not seem the right measure to be used 
in such a case. Probably researchers are interested in the shape of both 
growth and growth velocity curves during the years. For this reason, we re¬ 
peated the analysis by using the Penrose shape distance as dehned in equa¬ 
tion ( [I^ . Figure]^ shows the estimated centers for both the growth and the 
growth velocity curves. The recognized centers are really similar to the ones 
obtained by Sangalli et al. (2010a; 2010b): hrstly, as conhrmed by looking 
at tables and with respect to tables and there is a neat separation 
of boys and girls. Secondly, by looking at right hand side of hgure boys 
start to grow later but they seem to have a more pronounced growth, as it 
can be noticed by looking at the higher peak in correspondence of 15 year. 


Figure about here. 

Table |2] about here. 

Table H] about here. 

Table |4] about here. 

Table |5] about here. 

The Fuzzy Rand index is equal to 0.8884 and 0.8240 by using the Euclidean 
distance for the partitions of growth and growth velocity curves respectively. 
The Fuzzy Rand index is equal to 1.000 and 0.9246 by using the Penrose 
shape distance for the partitions of growth and growth velocity curves re¬ 
spectively. 

4 Concluding remarks 

In this paper we merged two approaches, theoretically motivated for respec¬ 
tively unsupervised and supervised classihcation cases, to propose a new 
non-hierachical fuzzy clustering algorithm. 

From the Probabilistic Distance (PD) clustering (Ben-lsrael and lyigun, 
2008) approach we shared the idea of determining the probabilities of each 
series to any of the k clusters. As this probability is directly related to the 
distance of each series from the cluster centers, there are no degrees of free¬ 
dom in determine the membership matrix. 
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From the Boosting approach (Freund and Schapire, 1997) we shared the 
idea of weighting each series according some measure of badness of £t in order 
to define an unsupervised learning process based on a weighted resampling 
procedure. In contrast to the boosting approach, the higher the probability 
of a given instance to be member of a given cluster, the higher the weight of 
that instance in the resampling process. As a learner we can use any smooth¬ 
ing spline technique. We used a P-spline smoother (Filers and Marx, 1996) 
because of its nice properties and we choose the optimal spline parameter 
with the V-curve criterion as defined by Frasso and Filers (2015). In this 
way we defined a suitable loss function and, at the same time, we proposed a 
fuzzy clustering procedure that does not depend on the definition of a fuzzi¬ 
fier parameter. 

To evaluate the performance of our proposal, we conducted three exper¬ 
iments, one of them on simulated data and the remaining two on data sets 
known in literature. The results show that our Boosted-oriented procedure 
show good performance in terms of data partitioning. Fven if the final fuzzy 
partition is sensitive to the choice of a distance measure, it is independent on 
any other input parameters. This consideration allows to define a suitable 
true fuzzy partition with which evaluate the final solution in terms of Fuzzy 
Rand Index (Hiillermeier et ai, 2012). The weigthed re-sampling process 
allows each series to contribute to the composition of each cluster as well as 
the adaptive estimation of cluster centers allows the algorithm to learn by 
its progresses. 

It is worth-nothing that, as in any partitioning problem, the choice of the 
distance measure can influence the goodness of partition. 
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Figures 



Figure 1: Data set generated for simulation study. 
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Boosting iterations 


Figure 2: BC function progress through 100 boosting iterations: (a) = 100 
boosting iterations; (b) = 500 boosting iterations; (c) = 1000 boosting iter¬ 
ations; (d) = 5000 boosting iterations; (e) = 10000 boosting iterations. 
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Figure 3: Estimated cluster centers. 
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Figure 4: Synthetic.tseries data set. 
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Figure 5: Growth curves (left hand side) and growth velocity curves (right 
hand side) of 93 children from Berkeley Growth Study data. 
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Figure 6: Estimated centers of growth curves (left hand side) and growth 
velocities (right hand side): Euclidean distance. 
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Figure 7: Estimated centers of growth curves (left hand side) and growth 
velocities (right hand side): Penrose shape distance. 
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Tables 


Table 1: Confusion matrix from clustering on Synthetic.tseries data set. 



Estimated Clusters 

Cl 

C2 

C3 

C4 

C5 

C6 

True Clusters 

a 

0 

0 

0 

0 

0 

3 

b 

0 

1 

0 

2 

0 

0 

c 

3 

0 

0 

0 

0 

0 

d 

0 

3 

0 

0 

0 

0 

e 

0 

0 

3 

0 

0 

0 

f 

0 

0 

0 

0 

3 

0 


Table 2: Confusion matrix of growth curves with the Euclidean distance. 
Series have been assigned to the clusters according the values of membership 
probabilities computed as in equation ([^. 



Cluster 1 

Cluster 2 

Boys 

23 

16 

Girls 

16 

38 


Table 3: Confusion matrix of growth velocity curves with the Euclidean 
distance. Series have been assigned to the clusters according the values of 
membership probabilities computed as in equation ([^. 



Cluster 1 

Cluster 2 

Boys 

31 

8 

Girls 

9 

45 


Table 4: Confusion matrix of growth curves with the Penrose shape distance. 
Series have been assigned to the clusters according the values of membership 
probabilities computed as in equation ([^. 



Cluster 1 

Cluster 2 

Boys 

0 

39 

Girls 

52 

2 
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Table 5: Confusion matrix of growth velocity curves with the Penrose shape 
distance. Series have been assigned to the clusters according the values of 
membership probabilities computed as in equation ([^. 



Cluster 1 

Cluster 2 

Boys 

36 

3 

Girls 

4 

49 
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